NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving

Lin, Yong; Tang, Shange; Lyu, Bohan; Wu, Jiayun; Lin, Hongzhou; Yang, Kaiyu; Li, Jia; Xia, Mengzhou; Chen, Danqi; Arora, Sanjeev; et al (October 2025, Conference on language modeling)

Full Text Available
Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

Razin, Noam; Malladi, Sadhika; Bhaskar, Adithya; Chen, Danqi; Arora, Sanjeev; Hanin, Boris (January 2025, Proceedings of ICLR 2025)

Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on the causes and implications of this counterintuitive phenomenon, which we term likelihood displacement. We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning. As a simple example, training a model to prefer over can sharply increase the probability of . Moreover, when aligning the model to refuse unsafe prompts, we show that such displacement can unintentionally lead to unalignment, by shifting probability mass from preferred refusal responses to harmful responses (e.g., reducing the refusal rate of Llama-3-8B-Instruct from 74.4% to 33.4%). We theoretically characterize that likelihood displacement is driven by preferences that induce similar embeddings, as measured by a centered hidden embedding similarity (CHES) score. Empirically, the CHES score enables identifying which training samples contribute most to likelihood displacement in a given dataset. Filtering out these samples effectively mitigated unintentional unalignment in our experiments. More broadly, our results highlight the importance of curating data with sufficiently distinct preferences, for which we believe the CHES score may prove valuable
more » « less
Full Text Available
ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty

Wu, Xindi; Yu, Dingli; Huang, Yangsibo; Russakovsky, Olga; Arora, Sanjeev (December 2024, Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track)

Full Text Available
What Makes a Reward Model a Good Teacher? An Optimization Perspective

Razin, Noam; Wang, Zixuan; Strauss, Hubert; Wei, Stanley; Lee, Jason; Arora, Sanjeev (December 2024, NeurIPS 2025)

Full Text Available
Advancing science- and evidence-based AI policy

https://doi.org/10.1126/science.adu8449

Bommasani, Rishi; Arora, Sanjeev; Chayes, Jennifer; Choi, Yejin; Cuéllar, Mariano-Florentino; Fei-Fei, Li; Ho, Daniel E; Jurafsky, Dan; Koyejo, Sanmi; Lakkaraju, Hima; et al (July 2025, Science)

Policy must be informed by, but also facilitate the generation of, scientific evidence
more » « less
Full Text Available
A Kernel-Based View of Language Model Fine-Tuning

Malladi, Sadhika; Wettig, Alexander; Yu, Dingli; Chen, Danqi; Arora, Sanjeev. (July 2023, Proceedings of the 40th International Conference on Machine Learning)

It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e.g., why fine-tuning a model with $10^8$ or more parameters on a couple dozen training points does not result in overfitting. We investigate whether the Neural Tangent Kernel (NTK)—which originated as a model to study the gradient descent dynamics of infinitely wide networks with suitable random initialization—describes fine-tuning of pre-trained LMs. This study was inspired by the decent performance of NTK for computer vision tasks (Wei et al., 2022). We extend the NTK formalism to Adam and use Tensor Programs (Yang, 2020) to characterize conditions under which the NTK lens may describe fine-tuning updates to pre-trained language models. Extensive experiments on 14 NLP tasks validate our theory and show that formulating the downstream task as a masked word prediction problem through prompting often induces kernel-based dynamics during fine-tuning. Finally, we use this kernel view to propose an explanation for the success of parameter-efficient subspace-based fine-tuning methods.
more » « less
Do Transformers Parse while Predicting the Masked Word?

https://doi.org/10.18653/v1/2023.emnlp-main.1029

Zhao, Haoyu; Panigrahi, Abhishek; Ge, Rong; Arora, Sanjeev (January 2023, Association for Computational Linguistics)

Pre-trained language models have been shown to encode linguistic structures like parse trees in their embeddings while being trained unsupervised. Some doubts have been raised whether the models are doing parsing or only some computation weakly correlated with it. Concretely: (a) Is it possible to explicitly describe transformers with realistic embedding dimensions, number of heads, etc. that are capable of doing parsing — or even approximate parsing? (b) Why do pre-trained models capture parsing structure? This paper takes a step toward answering these questions in the context of generative modeling with PCFGs. We show that masked language models like BERT or RoBERTa of moderate sizes can approximately execute the Inside-Outside algorithm for the English PCFG (Marcus et al., 1993). We also show that the Inside-Outside algorithm is optimal for masked language modeling loss on the PCFG-generated data. We conduct probing experiments on models pre-trained on PCFG-generated data to show that this not only allows recovery of approximate parse tree, but also recovers marginal span probabilities computed by the Inside-Outside algorithm, which suggests an implicit bias of masked language modeling towards this algorithm.
more » « less
Full Text Available
Language Models as Science Tutors

Chevalier, Alexis; Geng, Jiayi; Wettig, Alexander; Chen, Howard; Mizera, Sebastian; Annala, Toni; Aragon, Max_Jameson; Rodriguez_Fanlo, Arturo; Frieder, Simon; Machado, Simon; et al (May 2024, International Conference on Machine Learning)

Full Text Available
Understanding Contrastive Learning Requires Incorporating Inductive Biases

Saunshi, Nikunj; Ash, Jordan; Goel, Surbhi; Misra, Dipendra; Zhang, Cyril; Arora, Sanjeev; Kakade, Sham; Krishnamurthy, Akshay (January 2022, Proceedings of Machine Learning Research)

Full Text Available
Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks

Arora, Sanjeev; Du, Simon S.; Li, Zhiyuan; Salakhutdinov, Ruslan; Wang, Ruosong; Yu, Dingli (January 2020, ICLR 2020)

Full Text Available

« Prev Next »

Search for: All records